Source file ⇒ The_Analytics_Edge_edX_MIT15.071x_June2015_4.rmd

Unit 5: Text Analytics

Preliminaries

NOTE:I have gone head with some summary outputs which i could have restrained from doing, but the main intention was to see whether the function was performing as desired as there were some issues related to tm package version as used in the lecture and the latest version.

Turning Tweets into Knowldege_An Introduction to Text Analytics_1

INTRODUCTION

We will be trying to understand sentiment of tweets about the company Apple.

While Apple has a large number of fans, they also have a large number of people who don’t like their products. They also have several competitors.
To better understand public perception, Apple wants to monitor how people feel over time and how people receive new announcements.

Our challenge in this lecture is to see if we can correctly classify tweets as being negative,positive, or neither about Apple.

The Data

To collect the data needed for this task, we had to perform two steps.

  • Collect Twitter data

The first was to collect data about tweets from the internet.
Twitter data is publicly available, and it can be collected it through scraping the website or via the Twitter API.

The sender of the tweet might be useful to predict sentiment, but we will ignore it to keep our data anonymized.
So we will just be using the text of the tweet.

  • Construct the outcome variable

Then we need to construct the outcome variable for these tweets, which means that we have to label them as positive, negative, or neutral sentiment.

We would like to label thousands of tweets, and we know that two people might disagree over the correct classification of a tweet. To do this efficiently, one option is to use the Amazon Mechanical Turk.

The task that we put on the Amazon Mechanical Turk was to judge the sentiment expressed by the following item toward the software company Apple.
The items we gave them were tweets that we had collected. The workers could pick from the following options as their response:

  • strongly negative,
  • negative,
  • neutral,
  • positive, and
  • strongly positive.

These outcomes were represented as a number on the scale from -2 to 2.

Each tweet was labeled by five workers. For each tweet, we take the average of the five scores given by the five workers, hence the final scores can range from -2 to 2 in increments of 0.2.

The following graph shows the distribution of the number of tweets classified into each of the categories. We can see here that the majority of tweets were classified as neutral, with a small number classified as strongly negative or strongly positive.

distribution of score

distribution of score

So now we have a bunch of tweets that are labeled with their sentiment. But how do we build independent variables from the text of a tweet to be used to predict the sentiment?

A Bag of Words

One of the most used techniques to transforms text into independent variables is that called Bag of Words.

Fully understanding text is difficult, but Bag of Words provides a very simple approach: it just counts the number of times each word appears in the text and uses these counts as the independent variables.

For example, in the sentence,

"This course is great.  I would recommend this course to my friends,"

the word this is seen twice, the word course is seen twice, the word great is seen once, et cetera.

bag of words

bag of words

In Bag of Words, there is one feature for each word. This is a very simple approach, but is often very effective, too. It is used as a baseline in text analytics projects and for Natural Language Processing.

This is not the whole story, though. Preprocessing the text can dramatically improve the performance of the Bag of Words method.

Cleaning Up Irregularities

One part of preprocessing the text is to clean up irregularities.
Text data often as many inconsistencies that will cause algorithms trouble. Computers are very literal by default.

  • One common irregularity concerns the case of the letters, and it is customary to change all words to either lower-case or upper-case.

  • Punctuation also causes problems, and the basic approach is to remove everything that is not a letter. However some punctuation is meaningful, and therefore the removal of punctuation should be tailored to the specific problem.

There are also unhelpful terms:

  • Stopwords: they are words used frequently but that are only meaningful in a sentence. Examples are the, is, at, and which. It’s unlikely that these words will improve the machine learning prediction quality, so we want to remove them to reduce the size of the data.
    • There are some potential problems with this approach. Sometimes, two stop words taken together have a very important meaning (e.g. the name of the band “The Who”). By removing the stop words, we remove both of these words, but The Who might actually have a significant meaning for our prediction task.
  • Stemming: This step is motivated by the desire to represent words with different endings as the same word. We probably do not need to draw a distinction between argue, argued, argues, and arguing. They could all be represented by a common stem, argu. The algorithmic process of performing this reduction is called stemming.
    There are many ways to approach the problem.

    1. One approach is to build a database of words and their stems.
      • A pro is that this approach handles exceptions very nicely, since we have defined all of the stems.
      • However, it will not handle new words at all, since they are not in the database.
        This is especially bad for problems where we’re using data from the internet, since we have no idea what words will be used.
    2. A different approach is to write a rule-based algorithm.
      In this approach, if a word ends in things like ed, ing, or ly, we would remove the ending.
      • A pro of this approach is that it handles new or unknown words well.
      • However, there are many exceptions, and this approach would miss all of these.
        Words like child and children would be considered different, but it would get other plurals, like dog and dogs.

    This second approach is widely popular and is called the Porter Stemmer, designed by Martin Porter in 1980, and it’s still used today.

VIDEO 2: TEXT ANALYTICS

QUICK QUESTION

Which of these problems is the LEAST likely to be a good application of natural language processing?

Ans:Judging the winner of a poetry contest

EXPLANATION:Judging the winner of a poetry contest requires a deep level of human understanding and emotion. Perhaps someday a computer will be able to accurately judge the winner of a poetry contest, but currently the other three tasks are much better suited for natural language processing.

VIDEO 3: CREATING THE DATASET

QUICK QUESTION

For each tweet, we computed an overall score by averaging all five scores assigned by the Amazon Mechanical Turk workers. However, Amazon Mechanical Turk workers might make significant mistakes when labeling a tweet. The mean could be highly affected by this.

Which of the three alternative metrics below would best capture the typical opinion of the five Amazon Mechanical Turk workers, would be less affected by mistakes, and is well-defined regardless of the five labels?

Ans:An overall score equal to the median (middle) score

EXPLANATION:The correct answer is the first one - the median would capture the typical opinion of the workers and tends to be less affected by significant mistakes. The majority score might not have given a score to all tweets because they might not all have a majority score (consider a tweet with scores 0, 0, 1, 1, and 2). The minimum score does not necessarily capture the typical opinion and could be highly affected by mistakes (consider a tweet with scores -2, 1, 1, 1, 1).

VIDEO 4: BAG OF WORDS

QUICK QUESTION

For each of the following questions, pick the preprocessing task that we discussed in the previous video that would change the sentence “Data is useful AND powerful!” to the new sentence listed in the question.

New sentence: Data useful powerful!

Ans:Removing stop words

New sentence: data is useful and powerful

Ans:Cleaning up irregularities (changing to lowercase and removing punctuation)

New sentence: Data is use AND power!

Ans:Stemming

EXPLANATION:The first new sentence has the stop words “is” and “and” removed. The second new sentence has the irregularities removed (no capital letters or punctuation). The third new sentence has the words stemmed - the “ful” is removed from “useful” and “powerful

VIDEO 5: PRE-PROCESSING IN R (R script reproduced here)

Sys.setlocale("LC_ALL", "C")
## [1] "C"
# Unit 5 - Twitter

# VIDEO 5

#LOADING AND PROCESSING DATA IN R
tweets = read.csv("tweets.csv", stringsAsFactors=FALSE) 
#Note: when working on a text analytics problem it is important (necessary!) to add the extra argument stringsAsFactors = FALSE, so that the text is read in properly.

#Let's take a look at the structure of our data:
str(tweets)
## 'data.frame':    1181 obs. of  2 variables:
##  $ Tweet: chr  "I have to say, Apple has by far the best customer care service I have ever received! @Apple @AppStore" "iOS 7 is so fricking smooth & beautiful!! #ThanxApple @Apple" "LOVE U @APPLE" "Thank you @apple, loving my new iPhone 5S!!!!!  #apple #iphone5S pic.twitter.com/XmHJCU4pcb" ...
##  $ Avg  : num  2 2 1.8 1.8 1.8 1.8 1.8 1.6 1.6 1.6 ...
#We have 1181 observations of 2 variables:
##Tweet: the text of the tweet.
##Avg: the average sentiment score.

#The tweet texts are real tweets that gathered on the internet directed to Apple with a few cleaned up words.We are more interested in being able to detect the tweets with clear negative sentiment, so let's define a new variable in our data set called Negative.

#equal to TRUE if the average sentiment score is less than or equal to -1
#equal to FALSE if the average sentiment score is greater than -1.

# Create dependent variable
tweets$Negative = as.factor(tweets$Avg <= -1)

table(tweets$Negative)
## 
## FALSE  TRUE 
##   999   182
#Now to pre process our text data so that we could we could use the 'Bag of words' approach , we will be using the'tm'-- text mining package

#install.packages("tm")
library(tm)
#install.packages("SnowballC")
library(SnowballC)


#One of the concepts introduced by tm package is that of a corpus.A corpus is a collection of documents.We need to convert our tweets into corpus for pre processing.

#Various function in the tm package can be used to create a corpus in many different ways.We will create it from the tweet column of our data frame using two functions, Corpus() and VectorSource(). We feed to this latter the Tweets variable of the tweets data frame.

# Create corpus
corpus = Corpus(VectorSource(tweets$Tweet))

# Look at corpus
corpus
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 1181
#We can check that the documents match our tweets by using double brackets [[.
#To inspect the first (or 10th) tweet in our corpus, we select the first (or 10th) element as:
attributes(corpus[[1]])
## $names
## [1] "content" "meta"   
## 
## $class
## [1] "PlainTextDocument" "TextDocument"
corpus[[1]]$content
## [1] "I have to say, Apple has by far the best customer care service I have ever received! @Apple @AppStore"
corpus[[10]]$content
## [1] "Just checked out the specs on the new iOS 7...wow is all I have to say! I can't wait to get the new update ?? Bravo @Apple"
# IMPORTANT NOTE: If you are using the latest version of the tm package, you will need to run the following line before continuing (it converts corpus to a Plain Text Document). This is a recent change having to do with the tolower function that occurred after this video was recorded.
corpus = tm_map(corpus, PlainTextDocument)


#Converting text to lower case

#Pre-processing is easy in tm.
#Each operation, like stemming or removing stop words, can be done with one line in R, where we use the tm_map() function which takes as its first argument the name of a corpus and as second argument a function performing the transformation that we want to apply to the text.

#To transform all text to lower case:
corpus = tm_map(corpus, content_transformer(tolower))

#Checking the same two "documents" as before:
corpus[[1]]$content
## [1] "i have to say, apple has by far the best customer care service i have ever received! @apple @appstore"
corpus[[10]]$content
## [1] "just checked out the specs on the new ios 7...wow is all i have to say! i can't wait to get the new update ?? bravo @apple"
# Removing punctuation
corpus = tm_map(corpus, removePunctuation)
corpus[[1]]$content
## [1] "i have to say apple has by far the best customer care service i have ever received apple appstore"
corpus[[10]]$content
## [1] "just checked out the specs on the new ios 7wow is all i have to say i cant wait to get the new update  bravo apple"
# Look at stop words provided by tm package.It is necessary to define a list of words that we regard as being stop words, and for this the tm package provides a default list for the English language. We can check it out with:
stopwords("english")[1:10]
##  [1] "i"         "me"        "my"        "myself"    "we"       
##  [6] "our"       "ours"      "ourselves" "you"       "your"
length(stopwords("english"))
## [1] 174
#Next we want to remove the stop words in our tweets.
#Removing words can be done with the removeWords argument to the tm_map() function, with an extra argument, i.e. what the stop words are that we want to remove.
#We will remove all of these English stop words, but we will also remove the word "apple" since all of these tweets have the word "apple" and it probably won't be very useful in our prediction problem.

# Removing stopwords and apple
corpus = tm_map(corpus, removeWords, c("apple", stopwords("english")))
corpus[[1]]$content
## [1] "   say    far  best customer care service   ever received  appstore"
corpus[[10]]$content
## [1] "just checked   specs   new ios 7wow      say  cant wait  get  new update  bravo "
#Stemming

#Lastly, we want to stem our document with the stemDocument argument.

# Stem document 
corpus = tm_map(corpus, stemDocument)
corpus[[1]]$content
## [1] "   say    far  best custom care servic   ever receiv  appstor"
corpus[[10]]$content
## [1] "just check   spec   new io 7wow      say  cant wait  get  new updat  bravo"
#We can see that this took off the ending of "customer," "service," "received," and "appstore."

##################################

#QUICK QUESTION  

#Q:Given a corpus in R, how many commands do you need to run in R to clean up the irregularities (removing capital letters and punctuation)?
#Ans:2

#Q:How many commands do you need to run to stem the document?
#Ans:1

#EXPLANATION:In R, you can clean up the irregularities with two lines:
#corpus = tm_map(corpus, tolower)
#corpus = tm_map(corpus, removePunctuation) And you can stem the document with one line:
#corpus = tm_map(corpus, stemDocument)

VIDEO 6: BAG OF WORDS IN R (R script reproduced here)

# Video 6

#Create a Document Term Matrix

#We are now ready to extract the word frequencies to be used in our prediction problem. The tm package provides a function called DocumentTermMatrix() that generates a matrix where:
#the rows correspond to documents, in our case tweets, and
#the columns correspond to words in those tweets.
#The values in the matrix are the number of times that word appears in each document.

corpus = tm_map(corpus, PlainTextDocument)

# Create matrix
frequencies=DocumentTermMatrix(corpus)

frequencies
## <<DocumentTermMatrix (documents: 1181, terms: 3289)>>
## Non-/sparse entries: 8980/3875329
## Sparsity           : 100%
## Maximal term length: 115
## Weighting          : term frequency (tf)
#We see that in the corpus there are 3289 unique words.

#Let's see what this matrix looks like using the inspect() function, in particular slicing a block of rows/columns from the Document Term Matrix by calling by their indices:

# Look at matrix 
inspect(frequencies[1000:1005,505:515])
## <<DocumentTermMatrix (documents: 6, terms: 11)>>
## Non-/sparse entries: 1/65
## Sparsity           : 98%
## Maximal term length: 9
## Weighting          : term frequency (tf)
## 
##               Terms
## Docs           cheapen cheaper check cheep cheer cheerio cherylcol chief
##   character(0)       0       0     0     0     0       0         0     0
##   character(0)       0       0     0     0     0       0         0     0
##   character(0)       0       0     0     0     0       0         0     0
##   character(0)       0       0     0     0     0       0         0     0
##   character(0)       0       0     0     0     0       0         0     0
##   character(0)       0       0     0     0     1       0         0     0
##               Terms
## Docs           chiiiiqu child children
##   character(0)        0     0        0
##   character(0)        0     0        0
##   character(0)        0     0        0
##   character(0)        0     0        0
##   character(0)        0     0        0
##   character(0)        0     0        0
#In this range we see that the word "cheer" appears in the tweet 1005, but "cheap" does not appear in any of these tweets. This data is what we call sparse. This means that there are many zeros in our matrix.

#We can look at what the most popular terms are, or words, with the function findFreqTerms(), selecting a minimum number of 20 occurrences over the whole corpus:
# Check for sparsity
findFreqTerms(frequencies, lowfreq=20)
##  [1] "android"              "anyon"                "app"                 
##  [4] "appl"                 "back"                 "batteri"             
##  [7] "better"               "buy"                  "can"                 
## [10] "cant"                 "come"                 "dont"                
## [13] "fingerprint"          "freak"                "get"                 
## [16] "googl"                "ios7"                 "ipad"                
## [19] "iphon"                "iphone5"              "iphone5c"            
## [22] "ipod"                 "ipodplayerpromo"      "itun"                
## [25] "just"                 "like"                 "lol"                 
## [28] "look"                 "love"                 "make"                
## [31] "market"               "microsoft"            "need"                
## [34] "new"                  "now"                  "one"                 
## [37] "phone"                "pleas"                "promo"               
## [40] "promoipodplayerpromo" "realli"               "releas"              
## [43] "samsung"              "say"                  "store"               
## [46] "thank"                "think"                "time"                
## [49] "twitter"              "updat"                "use"                 
## [52] "via"                  "want"                 "well"                
## [55] "will"                 "work"
#Out of the 3289 words in our matrix, only 56 words appear at least 20 times in our tweets.
#This means that we probably have a lot of terms that will be pretty useless for our prediction model. The number of terms is an issue for two main reasons:
#One is computational: more terms means more independent variables, which usually means it takes longer to build our models.
#The other is that in building models the ratio of independent variables to observations will affect how well the model will generalize.

# Remove sparse terms(removing some terms that don't appear very often.)
sparse = removeSparseTerms(frequencies, 0.995)

#This function takes a second parameters, the sparsity threshold. The sparsity threshold works as follows.
#If we say 0.98, this means to only keep terms that appear in 2% or more of the tweets.
#If we say 0.99, that means to only keep terms that appear in 1% or more of the tweets.
#If we say 0.995, that means to only keep terms that appear in 0.5% or more of the tweets, about six or more tweets.

#Let's see what the new Document Term Matrix properties look like:
sparse
## <<DocumentTermMatrix (documents: 1181, terms: 309)>>
## Non-/sparse entries: 4669/360260
## Sparsity           : 99%
## Maximal term length: 20
## Weighting          : term frequency (tf)
#It only contains 309 unique terms, i.e. only about 9.4% of the full set.

# Convert sparse to a data frame to use for predictive modeling
tweetsSparse = as.data.frame(as.matrix(sparse))


#Fix variables names in the data frame

#Since R struggles with variable names that start with a number, and we probably have some words here that start with a number, we should run the make.names() function to make sure all of our words are appropriate variable names. It will convert the variable names to make sure they are all appropriate names for R before we build our predictive models. You should do this each time you build a data frame using text analytics.

# Make all variable names R-friendly
colnames(tweetsSparse) = make.names(colnames(tweetsSparse))

# Add dependent variable
#We should add back to this data frame our dependent variable to this data set. We'll call it tweetsSparse$Negative and set it equal to the original Negative variable from the tweets data frame.
tweetsSparse$Negative = tweets$Negative


# Split the data in training/testing sets
library(caTools)

set.seed(123)

split = sample.split(tweetsSparse$Negative, SplitRatio = 0.7)

trainSparse = subset(tweetsSparse, split==TRUE)
testSparse = subset(tweetsSparse, split==FALSE)

#QUICK QUESTION  

#In the previous video, we showed a list of all words that appear at least 20 times in our tweets. Which of the following words appear at least 100 times? Select all that apply. (HINT: use the findFreqTerms function)
findFreqTerms(frequencies, lowfreq=100)
## [1] "iphon" "itun"  "new"
#Ans:"iphon", "itun", and "new"

VIDEO 7: PREDICTING SENTIMENT (R script reproduced here)

# Video 7

# Build a CART model

library(rpart)
library(rpart.plot)


#Let's first use CART to build a predictive model, using the rpart() function to predict Negative using all of the other variables as our independent variables and the data set trainSparse.

#We'll add one more argument here, which is method = "class" so that the rpart() function knows to build a classification model. We keep default settings for all other parameters, in particular we are not adding anything for minbucket or cp.
#Building the classification model with all the IVs 
tweetCART = rpart(Negative ~ ., data=trainSparse, method="class")

#plotting the tree
prp(tweetCART)

#The tree says that
#if the word "freak" is in the tweet, then predict TRUE, or negative sentiment.
#If the word "freak" is not in the tweet, but the word "hate" is again predict TRUE.
#If neither of these two words are in the tweet, but the word "wtf" is, also predict TRUE, or negative sentiment.
#If none of these three words are in the tweet, then predict FALSE, or non-negative sentiment.

#This tree makes sense intuitively since these three words are generally seen as negative words.


# Evaluate the Out-of-Sample numerical performance of the model to get class predictions
#sing the predict() function we compute the predictions of our model tweetCART on the new data set testSparse. Be careful to add the argument type = "class" to make sure we get class predictions.
predictCART = predict(tweetCART, newdata=testSparse, type="class")

#computing the confusion matrix from the predictions
cmat_CART<-table(testSparse$Negative, predictCART)
cmat_CART
##        predictCART
##         FALSE TRUE
##   FALSE   294    6
##   TRUE     37   18
# Compute accuracy
accu_CART <- (cmat_CART[1,1] + cmat_CART[2,2])/sum(cmat_CART) #(294+18)/(294+6+37+18)=0.8788732
#Ans:Overall accuracy=0.8788732
#Sensitivity = 18 / 55 = 0.3273 ( = TP rate)
#Specificity = 294 / 300 = 0.98
#FP rate = 6 / 300 = 0.02


#Comparison with theBaseline accuracy 
#Let's compare this to a simple baseline model that always predicts non-negative (i.e. the most common value of the dependent variable).
#To compute the accuracy of the baseline model, let's make a table of just the outcome variable Negative.
cmat_baseline<-table(testSparse$Negative)
cmat_baseline
## 
## FALSE  TRUE 
##   300    55
accu_baseline <- max(cmat_baseline)/sum(cmat_baseline)#300/(300+55)=08450704
#Ans:Baseline model accuracy=0.8450704

#So our CARTt model does better than the baseline model.Lets see how Random Forest does?


#Random forest model

library(randomForest)
set.seed(123)


#Building the Random forest model with all the IVs (Takes considerably a long time since er have a large no. of IVs)
#We use the randomForest() function to predict Negative again using all of our other variables as independent variables and the data set trainSparse. Again we use the default parameter settings:
tweetRF = randomForest(Negative ~ ., data=trainSparse)

# Make Out-of-Sample predictions:
predictRF = predict(tweetRF, newdata=testSparse)

#computing the confusion matrix
cmat_RF<-table(testSparse$Negative, predictRF)
cmat_RF
##        predictRF
##         FALSE TRUE
##   FALSE   293    7
##   TRUE     34   21
#Overall model Accuracy:
accu_RF <- (cmat_RF[1,1] + cmat_RF[2,2])/sum(cmat_RF)
accu_RF #(293+21)/(293+7+34+21)=0.884507
## [1] 0.884507
#The overall accuracy of this Random Forest model is 0.884507

#The accuracy is slightly better than the CART model, but the interpretability of CART model is more compared to Random Forest and hence probably i would use the CART model
#If you were to use cross-validation to pick the cp parameter for the CART model, the accuracy would increase to about the same as the random forest model.So by using a bag-of-words approach and these models, we can reasonably predict sentiment even with a relatively small data set of tweets.

##################################

#QUICK QUESTION

#Comparison with logistic regression model

#In the previous video, we used CART and Random Forest to predict sentiment. Let's see how well logistic regression does. Build a logistic regression model (using the training set) to predict "Negative" using all of the independent variables. You may get a warning message after building your model - don't worry (we explain what it means in the explanation).

#Build the model, using all independent variables as predictors:
tweetLog<- glm(Negative ~ . , data =trainSparse, family = binomial)
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
#summary(tweetLog)


#Now, make predictions on the testing set using the logistic regression model:
predictLog= predict(tweetLog, newdata=testSparse, type="response")
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
#where "tweetLog" should be the name of your logistic regression model. You might also get a warning message after this command, but don't worry - it is due to the same problem as the previous warning message.

#Build a confusion matrix (with a threshold of 0.5) and compute the accuracy of the model.What is the accuracy?

# Confusion matrix with threshold of 0.5
cmat_log<-table(testSparse$Negative, predictLog> 0.5)
cmat_log
##        
##         FALSE TRUE
##   FALSE   253   47
##   TRUE     22   33
#lets now compute the overall accuracy
accu_log <- (cmat_log[1,1] + cmat_log[2,2])/sum(cmat_log)
accu_log #(253+33)/(253+47+22+33) = 0.8056338
## [1] 0.8056338
#Ans:0.8056338
#EXPLANATION:The accuracy is (253+33)/(253+47+22+33) = 0.8056338, which is worse than the baseline.

#The Perils of Over-fitting:
#If you were to compute the accuracy on the training set instead, you would see that the model does really well on the training set - this is an example of over-fitting. The model fits the training set really well, but does not perform well on the test set. A logistic regression model with a large number of variables is particularly at risk for overfitting.

#Note that you might have gotten a different answer than us, because the glm function struggles with this many variables. The warning messages that you might have seen in this problem have to do with the number of variables, and the fact that the model is overfitting to the training set. We'll discuss this in more detail in the Homework Assignment.

THE ANALYTICS EDGE

  • Analytical sentiment analysis can replace more labor-intensive methods like polling.
  • Text analytics can deal with the massive amounts of unstructured data being generated on the internet.
  • Computers are becoming more and more capable of interacting with humans and performing human tasks.

Man vs Machine_How IBM Built a jeopardy Champion_2

INTRODUCTION

How IBM Built a Jeopardy! Champion

A Grand Challenge

  • In 2004, IBM Vice President Charles Lickel and coworkers were having dinner at a restaurant
  • All of a sudden, the restaurant fell silent
  • Everyone was watching the game show Jeopardy! on the television in the bar
  • A contestant, Ken Jennings, was setting the record for the longest winning streak of all time (75 days)

Why was everyone so interested?

  • Jeopardy! is a quiz show that asks complex and clever questions (puns, obscure facts, uncommon words)
  • Originally aired in 1964
  • A huge variety of topics
  • Generally viewed as an impressive feat to do well
  • No computer system had ever been developed that could even come close to competing with humans on Jeopardy!

A Tradition of Challenges

  • IBM Research strives to push the limits of science
    • Have a tradition of inspiring and difficult challenges
  • Deep Blue - a computer to compete against the best human chess players
    • A task that people thought was restricted to human intelligence
  • Blue Gene - a computer to map the human genome
    • A challenge for computer speed and performance

The Challenge Begins

  • In 2005, a team at IBM Research started creating a computer that could compete at Jeopardy!
    • No one knew how to beat humans, or if it was even possible
  • Six years later, a two-game exhibition match aired on television
    • The winner would receive $1,000,000

The Contestants

Ken_Jennings

Ken_Jennings

  • Ken Jennings
    • Longest winning streak of 75 days
Brad_Rutter

Brad_Rutter

  • Brad Rutter
    • Biggest money winner of over $3.5 million
Watson

Watson

  • Watson
    • A supercomputer with 3,000 processors and a database of 200 million pages of information

The Match Begins

match begins!

match begins!

VIDEO 1: IBM WATSON

QUICK QUESTION

What were the goals of IBM when they set out to build Watson? Select all that apply.

Ans:To build a computer that could compete with the best human players at Jeopardy!.& To build a computer that could answer questions that are commonly believed to require human intelligence.

EXPLANATION:The main goals of IBM were to build a computer that could answer questions that are commonly believed to require human intelligence, and to therefore compete with the best human players at Jeopardy!.

VIDEO 2: THE GAME OF JEOPARDY

Overview of the Jeopardy! game

jeopardy

jeopardy

  • Three rounds per game
    • Jeopardy
    • Double Jeopardy (dollar values doubled)
    • Final Jeopardy (wager on response to one question)
  • Each round has five questions in six categories
    • Wide variety of topics (over 2,500 different categories)
  • Each question has a dollar value - the first to buzz in and answer correctly wins the money
    • If they answer incorrectly they lose the money

ExampleRound

Example_Round

Example_Round

Jeopardy! Questions

  • Cryptic definitions of categories and clues
  • Answer in the form of a question
    • Q: Mozart’s last and perhaps most powerful symphony shares its name with this planet.
    • A: What is Jupiter?
  • Q: Smaller than only Greenland, it’s the world’s second largest island.
    • A: What is New Guinea?

QUICK QUESTION

For which of the following reasons is Jeopardy! challenging? Select all that apply.

Ans:A wide variety of categories. , Speed is required - you have to buzz in faster than your competitors. , The categories and clues are often cryptic.

EXPLANATION:Jeopardy! is challenging because there are a wide variety of categories, speed is required, and the categories and clues are cryptic. Expert knowledge is not generally required.

VIDEO 3: WATSON’S DATABASE AND TOOLS

Why is Jeopardy Hard?

  • Wide variety of categories, purposely made cryptic
  • Computers can easily answer precise questions
    • What is the square root of (35672-183)/33?
  • Understanding natural language is hard
    • Where was Albert Einstein born?
    • Suppose you have the following information:
    “One day, from his city views of Ulm, Otto chose a water color to send to Albert Einstein as a remembrance of his birthplace.”
    • Ulm? Otto?

A Straightforward Approach

  • Let’s just store answers to all possible questions
  • This would be impossible
    • An analysis of 200,000 previous questions yielded over 2,500 different categories
  • Let’s just search Google
    • No links to the outside world permitted
    • It can take considerable skill to find the right webpage with the right information

Using Analytics

  • Watson received each question in text form
    • Normally, players see and hear the questions
  • IBM used analytics to make Watson a competitive player
  • Used over 100 different techniques for analyzing natural language, finding hypotheses, and ranking hypotheses

Watson’s Database and Tools

  • A massive number of data sources
    • Encyclopedias, texts, manuals, magazines, Wikipedia, etc.
  • Lexicon
    • Describes the relationship between different words
    • Ex: “Water” is a “clear liquid” but not all “clear liquids” are “water”
  • Part of speech tagger and parser
    • Identifies functions of words in text
    • Ex: “Race” can be a verb or a noun
      • He won the race by 10 seconds.
      • Please indicate your race.

How Watson Works

  • Step 1: Question Analysis
    • Figure out what the question is looking for
  • Step 2: Hypothesis Generation
    • Search information sources for possible answers
  • Step 3: Scoring Hypotheses
    • Compute confidence levels for each answer
  • Step 4: Final Ranking
    • Look for a highly supported answer

QUICK QUESTION

Which of the following two questions do you think would be EASIEST for a computer to answer?

Ans:What year was Abraham Lincoln born?

EXPLANATION:The second question would be the easiest, because the answer is a fact. The first question is much more subjective.

VIDEO 4: HOW WATSON WORKS - STEPS 1 AND 2

Step 1: Question Analysis

  • What is the question looking for?
  • Trying to find the Lexical Answer Type (LAT) of the question
    • Word or noun in the question that specifies the type of answer
  • Ex: “Mozart’s last and perhaps most powerful symphony shares its name with this planet.”
  • Ex: “Smaller than only Greenland, it’s the world’s secondlargest island.”

Step 1: Question Analysis

  • If we know the LAT, we know what to look for
  • In an analysis of 20,000 questions
    • 2,500 distinct LATs were found
    • 12% of the questions do not have an explicit LAT
    • The most frequent 200 explicit LATs cover less than 50% of the questions
  • Also performs relation detection to find relationships among words, and decomposition to split the question into different clues

Step 2: Hypothesis Generation

  • Uses the question analysis from Step 1 to produce candidate answers by searching the databases
  • Several hundred candidate answers are generated
  • Ex: “Mozart’s last and perhaps most powerful symphony shares its name with this planet.”
    • Candidate answers: Mercury, Earth, Jupiter, etc.
  • Then each candidate answer plugged back into the question in place of the LAT is considered a hypothesis
    • Hypothesis 1: “Mozart’s last and perhaps most powerful symphony shares its name with Mercury.”
    • Hypothesis 2: “Mozart’s last and perhaps most powerful symphony shares its name with Jupiter.”
    • Hypothesis 3: “Mozart’s last and perhaps most powerful symphony shares its name with Earth.”
  • If the correct answer is not generated at this stage,Watson has no hope of getting the question right
  • This step errors on the side of generating a lot of hypotheses, and leaves it up to the next step to find the correct answer

QUICK QUESTION

Select the LAT of the following Jeopardy question: NICHOLAS II WAS THE LAST RULING CZAR OF THIS ROYAL FAMILY (Hint: The answer is “The Romanovs”)

Ans:THIS ROYAL FAMILY

Select the LAT of the following Jeopardy question: REGARDING THIS DEVICE, ARCHIMEDES SAID, “GIVE ME A PLACE TO STAND ON, AND I WILL MOVE THE EARTH” (Hint: The answer is “A lever”)

Ans: THIS DEVICE

EXPLANATION:The LAT in the first question is “THIS ROYAL FAMILY” and the LAT in the second question is “THIS DEVICE”. Remember that if you replace the LAT with the correct answer, the sentence should make sense.

VIDEO 5: HOW WATSON WORKS - STEPS 3 AND 4

Step 3: Scoring Hypotheses

  • Compute confidence levels for each possible answer
    • Need to accurately estimate the probability of a proposed answer being correct
    • Watson will only buzz in if a confidence level is above a threshold
  • Combines a large number of different methods

Lightweight Scoring Algorithms

  • Starts with “lightweight scoring algorithms” to prune down large set of hypotheses
  • Ex: What is the likelihood that a candidate answer is an instance of the LAT?
    • If this likelihood is not very high, throw away the hypothesis
  • Candidate answers that pass this step proceed the next stage
    • Watson lets about 100 candidates pass into the next stage

Scoring Analytics

  • Need to gather supporting evidence for each candidate answer
  • Passage Search
    • Retrieve passages that contain the hypothesis text
    • Let’s see what happens when we search for our hypotheses on Google
    • Hypothesis 1: “Mozart’s last and perhaps most powerful symphony shares its name with Mercury.”
    • Hypothesis 2: “Mozart’s last and perhaps most powerful symphony shares its name with Jupiter.”

Passage Search

Passage Search

Passage Search

Passage_Search_diff

Passage_Search_diff

Scoring Analytics

  • Determine the degree of certainty that the evidence supports the candidate answers
  • More than 50 different scoring components
  • Ex: Temporal relationships
    • “In 1594, he took a job as a tax collector in Andalusia”
    • Two candidate answers: Thoreau and Cervantes
    • Thoreau was not born until 1817, so we are more confident about Cervantes

Step 4: Final Merging and Ranking

  • Selecting the single best supported hypothesis
  • First need to merge similar answers
    • Multiple candidate answers may be equivalent
      • Ex: “Abraham Lincoln” and “Honest Abe”
    • Combine scores
  • Rank the hypotheses and estimate confidence
    • Use predictive analytics

Ranking and Confidence Estimation

  • Training data is a set of historical Jeopardy! questions
  • Each of the scoring algorithms is an independent variable
  • Use logistic regression to predict whether or not a candidate answer is correct, using the scores
  • If the confidence for the best answer is high enough,Watson buzzes in to answer the question

The Watson System